The purpose of this segment is to share some of our favorite tools for working with data in R.
here: no more getting lost in file pathsUse this package
when…all the time. Make it a part of your regular coding routine. It’s that good.
As we introduced in the previous module, here is an excellent package that’s worth getting to know because it will let you use relative as opposed to absolute pathnames. This will simplify importing and exporting files as well as sharing them with others.
install.packages("here")
library(here)
## here() starts at C:/Users/sbrei/Documents/R_Projects/Collabs/BGSS_Retreat_2021
here::here() # once you call the package with the library(here) call, you can use this function to remind you where your project root begins.
## [1] "C:/Users/sbrei/Documents/R_Projects/Collabs/BGSS_Retreat_2021"
revisit “here” package description
Artwork by Allison Horst. set_wd, be gone!
magrittr: these pipes will make your work flowUse this package when…
install.packages("magrittr")
library(magrittr)
patchwork: make your figures nice and cozyUse this package when…
install.packages("patchwork")
library(patchwork)
performance: evaluate your general linear models in a flashUse this package when…
install.packages("performance")
library(performance)
annotater: finally remember why you loaded all those packages!Use this package when… you forget why you loaded packages at the top of your R script/notebook/markdown file OR you want to clarify why you did so for collaborators. (+1 points for reproducible science!)
It happens: you start your R file with a list of packages to be loaded with your library() call. You constantly add to it, listing more packages who functions you use to complete your analysis. Over time, you figure out that your advisors would prefer if you didn’t use the Wes Anderson color palette, or that you’re better off creating figures with patchwork vs cowplot (sorry, cowplot). So, if you’re anything like me, after learning stats and R simultaneously while doing your first thesis project, you end up with a very impressive list of R packages, half of which you can’t remember why you loaded in the first place.
Have no fear! This is where the genius of the annotater package comes in to save us (and those who try to read our code, bless them)!
install.packages("remotes")
remotes::install_github("luisDVA/annotater")
After you’ve installed remotes and annotater, save your R files, close RStudio, and reopen it.
Click anywhere in the Source pane (aka the one with your R files).
Navigate your cursor to the “Addins” button in the bar below the File-Edit-Code-View etc. bar. Click it and select, “Annotate package calls in active file”. Voila!
Gif by Luis Verde Arregoitia
dplyr: quick manipulations of your dataUse this package when… you are trying to quickly manipulate, summarize, or combine data.
Often we are required to summarize data by certain categories, groups, or treatments. In other instances, we are looking to create a new column that contains a metric specific for our analysis. The dplyr package uses a pipe-format that allows for an easy creation of a workflow. Additionally, dplyr allows for some basic data management including selecting certain columns, renaming them, or sorting data. All of these functions are based off an SQL backend, which provides some familiarity for those with a data or computer science background.
install.packages("dplyr")
library(dplyr)
## Load sample dataset
data(starwars)
### Calculate summary statistics for a group within the dataset.
summmarizedData <- starwars %>%
group_by(homeworld) %>% ## select variables to summarize by
summarize(avgHeight = mean(height), nObs= length(height)) ## select which columns to be summarized and by which function
summmarizedData
## # A tibble: 49 x 3
## homeworld avgHeight nObs
## <chr> <dbl> <int>
## 1 Alderaan 176. 3
## 2 Aleen Minor 79 1
## 3 Bespin 175 1
## 4 Bestine IV 180 1
## 5 Cato Neimoidia 191 1
## 6 Cerea 198 1
## 7 Champala 196 1
## 8 Chandrila 150 1
## 9 Concord Dawn 183 1
## 10 Corellia 175 2
## # ... with 39 more rows
### Calculate summary statistics for multiple groups within the dataset.
summmarizedData <- starwars %>%
group_by(homeworld, species) %>% ## select variables to summarize by
summarize(avgHeight = mean(height), nObs= length(height)) ## select which columns to be summarized and by which function
summmarizedData
## # A tibble: 58 x 4
## # Groups: homeworld [49]
## homeworld species avgHeight nObs
## <chr> <chr> <dbl> <int>
## 1 Alderaan Human 176. 3
## 2 Aleen Minor Aleena 79 1
## 3 Bespin Human 175 1
## 4 Bestine IV Human 180 1
## 5 Cato Neimoidia Neimodian 191 1
## 6 Cerea Cerean 198 1
## 7 Champala Chagrian 196 1
## 8 Chandrila Human 150 1
## 9 Concord Dawn Human 183 1
## 10 Corellia Human 175 2
## # ... with 48 more rows
## Create a new column of estimated BMI for each person
starwarsBMI <- starwars %>%
mutate(bmi = mass / (height/100)^2)
starwarsBMI %>% select(name, BMI = bmi) ## show calculated data and rename it to capitalize BMI
## # A tibble: 87 x 2
## name BMI
## <chr> <dbl>
## 1 Luke Skywalker 26.0
## 2 C-3PO 26.9
## 3 R2-D2 34.7
## 4 Darth Vader 33.3
## 5 Leia Organa 21.8
## 6 Owen Lars 37.9
## 7 Beru Whitesun lars 27.5
## 8 R5-D4 34.0
## 9 Biggs Darklighter 25.1
## 10 Obi-Wan Kenobi 23.2
## # ... with 77 more rows
## Find the individuals with the lowest BMI per planet
lowBMI <- starwarsBMI %>%
group_by(homeworld) %>%
slice(which.min(bmi))
lowBMI
## # A tibble: 40 x 15
## # Groups: homeworld [40]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 2 Ratts ~ 79 15 none grey, blue unknown NA male mascu~
## 3 Lobot 175 79 none light blue 37 male mascu~
## 4 Jek To~ 180 110 brown fair blue NA male mascu~
## 5 Nute G~ 191 90 none mottled gr~ red NA male mascu~
## 6 Ki-Adi~ 198 82 white pale yellow 92 male mascu~
## 7 Jango ~ 183 79 black tan brown 66 male mascu~
## 8 Han So~ 180 80 brown fair brown 29 male mascu~
## 9 Adi Ga~ 184 50 none dark blue NA fema~ femin~
## 10 Darth ~ 175 80 none red yellow 54 male mascu~
## # ... with 30 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>, bmi <dbl>
A cheatsheet for frequently used functions from dplyr can be found here
tidyr: quick manipulations of your data (cont)Use this package when… you are trying to quickly manipulate, summarize, or combine data.
The tidyr package comes from the same author as dplyr and share the same common syntax of pipes and SQL-style structure. There are many functions found within this package but there are two that I think are extremely useful: 1) converting data between long and wide formats and 2) separating a column into multiple.
install.packages("tidyr")
library(tidyr)
## Load sample dataset
data(starwars)
### Select a matrix that has the individual and species. Then convert it from long to wide format
longMat <- starwars %>%
select(name, species, mass) %>%
spread(species, mass)
## convert back to wide format
wideMat <- longMat %>%
gather(species, mass, 2:39)
## Split colour into multiple columns
starwars %>%
separate(skin_color, sep=", ", into=c("MainColour","SecondaryColour","AncillaryColour")) %>%
select(name, MainColour, SecondaryColour, AncillaryColour)
## # A tibble: 87 x 4
## name MainColour SecondaryColour AncillaryColour
## <chr> <chr> <chr> <chr>
## 1 Luke Skywalker fair <NA> <NA>
## 2 C-3PO gold <NA> <NA>
## 3 R2-D2 white blue <NA>
## 4 Darth Vader white <NA> <NA>
## 5 Leia Organa light <NA> <NA>
## 6 Owen Lars light <NA> <NA>
## 7 Beru Whitesun lars light <NA> <NA>
## 8 R5-D4 white red <NA>
## 9 Biggs Darklighter light <NA> <NA>
## 10 Obi-Wan Kenobi fair <NA> <NA>
## # ... with 77 more rows
Please take 5 minutes to check out the “Feedback Page” link below and give us some feedback about today’s workshop. This will help us teach better workshops, not to mention show future employers that we might actually be sort of good at this. Thanks!